Spatial Ecology and Macroecology

Practical - Week 1

Florencia Grattarola

(Department of Spatial Sciences)

2022-09-26

What are we going to see today?

  1. Data types
  2. Data sources
  3. Open data
    • Data standards
    • Licensing
    • Data sharing
  4. Data download through R
  5. Data quality

Data types

1. Data types

Data that can place a particular species in a particular place and time can take many forms.

1. Data types

  • Occurrence records: primary data – specimens from an herbarium or citizen-science observations.

1. Data types

  • Sampling events: effort is recorded – tracks, quadrats, camera traps, DNA sampling.

1. Data types

  • Checklists: list of taxa – species associated to a location.

1. Data types

  • Range-maps: expert-based – maps from field guides.

1. Data types

  • Atlases: info on distribution, abundance, long-term change – birds from the Czech Republic.

1. Data types

Data can also be defined as how they were collected.

1. Data types

  • Structured: standardized sampling protocol, site selection – sometimes stratified random.

1. Data types

  • Semi-structured: no standardized sampling protocol, site selection - free, metadata associated with data informs on survey methods.

1. Data types

  • Unstructured (opportunistic): no standardized sampling protocol, site selection – free, little metadata.

1. Data types

Finally, data can also be defined as how they are made available for others.

1. Data types

  • Disaggregated: precision is high, but completeness and representativeness is low.

1. Data types

  • Aggregated: precision is low, but completeness and representativeness is high.

While disaggregated data can produce reliable results for a limited set of well-covered regions, aggregated data types can provide critical information for the extrapolation of biodiversity patterns into less well-sampled regions.

2. Data sources

gbif.org

GBIF is an international network and data infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.

rgbif: https://github.com/ropensci/rgbif

obis.org

OBIS is a global open-access data and information clearing-house on marine biodiversity for science, conservation and sustainable development.

robis: https://github.com/iobis/robis

ebird.org

eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.

auk: https://cornelllabofornithology.github.io/auk/

ebird.org

eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.


rebird: https://github.com/ropensci/rebird

inaturalist.org

iNaturalist is one of the world’s most popular nature apps. It allows participants to contribute observations of any organism, or traces thereof, along with associated spatio-temporal metadata.

rinat: https://github.com/ropensci/rinat

mol.org

Map of Life endeavors to provide ‘best-possible’ species range information and species lists for any geographic area. The Map of Life assembles and integrates different sources of data describing species distributions worldwide.

iucnredlist.org

IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.

rredlist: https://github.com/ropensci/rredlist

iucnredlist.org

IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.

redlistr: https://github.com/ropensci/rredlist

bien.nceas.ucsb.edu/bien/

BIEN is a network of ecologists, botanists, and computer scientists working together to document global patterns of plant diversity, function and distribution.

rbien: https://github.com/bmaitner/RBIEN

sibbr.gov.br

SiBBr (Brazilian Biodiversity Information System) is an online platform that integrates data and information about biodiversity and ecosystems from different sources, making them accessible for different uses.

sibbr: https://github.com/sibbr

bto.org/our-science/projects/breeding-bird-survey

BBS (Breeding Bird Survey) involves thousands of volunteer birdwatchers carrying out standardised annual bird counts on randomly-located 1-km sites. It’s part of the NBN Atlas.

ala.org.au

ALA (Atlas of Living Australia) is a collaborative, digital, open infrastructure that pulls together Australian biodiversity data from multiple sources, making it accessible and reusable.

galah: https://galah.ala.org.au

living-atlases.gbif.org

The open community around the Atlas of Living Australia platform.

biotime.st-andrews.ac.uk

BioTime is an open access database global database of assemblage time series for quantifying and understanding biodiversity change.

BioTime Hub: https://github.com/bioTIMEHub

nhm.ac.uk/our-science/our-work/biodiversity/predicts

PREDICTS uses data on local biodiversity around the world to model how human activities affect biological communities. This biodiversity change is shown as the Biodiversity Intactness Index (BII).

EXERCISE Explore different data sources and find out..

Explore different data sources and find out:

  • What type of data they collate?
  • Which taxa do they cover?
  • What types of data formats are available?
  • Can everyone download the data? Are there any restrictions?
  • Which type of licences do their data have?

Pick only one data source.

3. Open Data

3. Open Data

Open means anyone can freely access, use, modify, and share for any purpose.


3. Open Data: Data standards

Darwin Core is the internationally agreed data standard to facilitate the sharing of information about biological diversity.

dwc.tdwg.org

countryCode: The standard code for the country in which the Location occurs. Recommended best practice is to use an ISO 3166-1-alpha-2 country code.
recordedBy: A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence.

3. Open Data: Licensing

Open data are licensed under open licenses. Some examples:


CC0: Public domain


CC-BY: Attribution


CC-BY-NC: Attribution - Non Commercial


CC-BY-SA: Attribution - Share Alike

3. Open Data: Data sharing

Data that are standardized and have an open licence can be shared :)

PRACTICE Mammal’s of the Czech Republic

As an example we will use the mammals of Czech Republic We will access data through GBIF

Some preparation before starting to code

  • Create a new project for all your practical sessions.

File > New project > New directory or Existing directory

  • Install the package tidyverse.
install.packages('tidyverse') # install
library(tidyverse) # load


We will be using many functions from this package, like filter(), mutate(), and later read_csv().

Data download through R

We will use rgbif.

First, we’ll need to install the package.

install.packages('rgbif')

To use it, we load the library and check it’s working.

library(rgbif)
packageVersion('rgbif')
[1] '3.7.3'

We will need the GBIF backbone taxon ID (taxonKey) for the Mammalia class. For that we will use another package called taxize.

install.packages('taxize')
library(taxize)
packageVersion('taxize')
[1] '0.9.100'

4. Data download through R

So, let’s get the taxon ID for the Mammalia class

get_gbifid_('Mammalia') 
$Mammalia
  usagekey                         scientificname   rank   status matchtype
1      359                               Mammalia  class ACCEPTED     EXACT
2  7423517 Mammamia Akkari, Stoev & Enghoff, 2011  genus ACCEPTED     FUZZY
3  9522622                               Mammaria  genus ACCEPTED     FUZZY
4  7688954    Mammaria Cesati ex Rabenhorst, 1854  genus ACCEPTED     FUZZY
5  2573090                    Mammaria Oken, 1815  genus  SYNONYM     FUZZY
6  6008010                  Mammaria Müller, 1776  genus DOUBTFUL     FUZZY
7  4899044                       Mammalian Prions family ACCEPTED     FUZZY
  canonicalname confidence   kingdom     phylum kingdomkey phylumkey classkey
1      Mammalia         94  Animalia   Chordata          1        44      359
2      Mammamia         74  Animalia Arthropoda          1        54      361
3      Mammaria         74     Fungi Ascomycota          5        95      320
4      Mammaria         74 Chromista    Myzozoa          4   8770992       NA
5      Mammaria         73 Chromista    Myzozoa          4   8770992  9049014
6      Mammaria         68  Animalia       <NA>          1        NA       NA
7     Mammalian         64   Viruses       <NA>          8        NA       NA
  synonym           class
1   FALSE        Mammalia
2   FALSE       Diplopoda
3   FALSE Sordariomycetes
4   FALSE            <NA>
5    TRUE     Dinophyceae
6   FALSE            <NA>
7   FALSE            <NA>
                                                                               note
1                                                                              <NA>
2  Similarity: name=75; authorship=0; classification=-2; rank=0; status=1; score=74
3  Similarity: name=75; authorship=0; classification=-2; rank=0; status=1; score=74
4  Similarity: name=75; authorship=0; classification=-2; rank=0; status=1; score=74
5  Similarity: name=75; authorship=0; classification=-2; rank=0; status=0; score=73
6 Similarity: name=75; authorship=0; classification=-2; rank=0; status=-5; score=68
7 Similarity: name=75; authorship=0; classification=-12; rank=0; status=1; score=64
         order            family     genus orderkey familykey genuskey
1         <NA>              <NA>      <NA>       NA        NA       NA
2       Julida           Julidae  Mammamia     1019      4012  7423517
3  Sordariales Lasiosphaeriaceae  Mammaria     1061      4162  9522622
4         <NA>              <NA>  Mammaria       NA        NA  7688954
5 Noctilucales     Noctilucaceae Noctiluca  8808938   8267551  7443358
6         <NA>              <NA>  Mammaria       NA        NA  6008010
7         <NA>         Mammalian      <NA>       NA   4899044       NA
  acceptedusagekey
1               NA
2               NA
3               NA
4               NA
5          7443358
6               NA
7               NA

4. Data download through R

So, let’s get the taxon ID for the Mammalia class

mammaliaTaxonKey <- get_gbifid_('Mammalia') %>% bind_rows() %>% 
    filter(matchtype == 'EXACT' & status == 'ACCEPTED') %>%
    pull(usagekey)

4. Data download through R

And now we can use the function occ_count() to find out the number of occurrence records for the entire Czech Republic.

occ_count(
  taxonKey = NULL,
  georeferenced = NULL,
  basisOfRecord = NULL,
  datasetKey = NULL,
  date = NULL,
  typeStatus = NULL,
  country = NULL,
  year = NULL,
  from = 2000,
  to = 2012,
  type = "count",
  publishingCountry = "US",
  protocol = NULL,
  curlopts = list()
)

4. Data download through R

How many occurrence records are in GBIF for the entire Czech Republic?

occ_count(country='CZ') # country code for Czech Republic (https://countrycode.org/)
[1] 3423706


And how many records for the mammals of Czech Republic?

occ_count(country='CZ', 
          taxonKey=mammaliaTaxonKey) 
[1] 6345


We are ready to do a download. Whoop!

4. Data download through R

To do this, we will use occ_search().

occ_search(
  taxonKey = NULL,
  scientificName = NULL,
  country = NULL,
  publishingCountry = NULL,
  hasCoordinate = NULL,
  typeStatus = NULL,
  recordNumber = NULL,
  lastInterpreted = NULL,
  continent = NULL,
  geometry = NULL,
  geom_big = "asis",
  geom_size = 40,
  geom_n = 10,
  recordedBy = NULL,
  recordedByID = NULL,
  identifiedByID = NULL,
  basisOfRecord = NULL,
  datasetKey = NULL,
  eventDate = NULL,
  catalogNumber = NULL,
  year = NULL,
  month = NULL,
  decimalLatitude = NULL,
  decimalLongitude = NULL,
  elevation = NULL,
  depth = NULL,
  institutionCode = NULL,
  collectionCode = NULL,
  hasGeospatialIssue = NULL,
  issue = NULL,
  search = NULL,
  mediaType = NULL,
  subgenusKey = NULL,
  repatriated = NULL,
  phylumKey = NULL,
  kingdomKey = NULL,
  classKey = NULL,
  orderKey = NULL,
  familyKey = NULL,
  genusKey = NULL,
  establishmentMeans = NULL,
  protocol = NULL,
  license = NULL,
  organismId = NULL,
  publishingOrg = NULL,
  stateProvince = NULL,
  waterBody = NULL,
  locality = NULL,
  limit = 500,
  start = 0,
  fields = "all",
  return = NULL,
  facet = NULL,
  facetMincount = NULL,
  facetMultiselect = NULL,
  skip_validate = TRUE,
  curlopts = list(),
  ...
)

4. Data download through R

Get occurrences records of mammals from Czech Republic.

occ_search(taxonKey=mammaliaTaxonKey,
           country='CZ') 
Records found [6345] 
Records returned [500] 
No. unique hierarchies [39] 
No. media records [500] 
No. facets [0] 
Args [occurrenceStatus=PRESENT, limit=500, offset=0, taxonKey=359, country=CZ,
     fields=all] 
# A tibble: 500 × 98
   key    scien…¹ decim…² decim…³ issues datas…⁴ publi…⁵ insta…⁶ hosti…⁷ publi…⁸
   <chr>  <chr>     <dbl>   <dbl> <chr>  <chr>   <chr>   <chr>   <chr>   <chr>  
 1 40115… Dama d…    49.2    16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 2 40116… Castor…    50.2    14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 3 40150… Myocas…    49.7    15.1 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 4 40181… Myocas…    50.1    14.4 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 5 40149… Sus sc…    49.2    16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 6 40149… Dama d…    49.2    16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 7 40149… Capreo…    49.6    16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 8 40149… Lepus …    49.6    16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 9 40149… Myocas…    50.1    14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
10 40148… Myocas…    49.8    14.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
# … with 490 more rows, 88 more variables: protocol <chr>, lastCrawled <chr>,
#   lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
#   occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
#   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
#   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
#   kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>,
#   species <chr>, genericName <chr>, specificEpithet <chr>, taxonRank <chr>, …

Check the data output. What’s the format? How many rows does it have?

4. Data download through R

Get all occurrences records of mammals from Czech Republic.

occ_search(taxonKey=mammaliaTaxonKey,
           country='CZ',
            limit=6000) 


Finally, we store the result in the object mammalsCZ.

mammalsCZ <- occ_search(taxonKey=mammaliaTaxonKey,
           country='CZ',
           limit=6000) 

mammalsCZ <- mammalsCZ$data

4. Data download through R

Mammals occurrence records from the Czech Republic

mammalsCZ
# A tibble: 6,000 × 179
   key    scien…¹ decim…² decim…³ issues datas…⁴ publi…⁵ insta…⁶ hosti…⁷ publi…⁸
   <chr>  <chr>     <dbl>   <dbl> <chr>  <chr>   <chr>   <chr>   <chr>   <chr>  
 1 40115… Dama d…    49.2    16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 2 40116… Castor…    50.2    14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 3 40150… Myocas…    49.7    15.1 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 4 40181… Myocas…    50.1    14.4 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 5 40149… Sus sc…    49.2    16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 6 40149… Dama d…    49.2    16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 7 40149… Capreo…    49.6    16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 8 40149… Lepus …    49.6    16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 9 40149… Myocas…    50.1    14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
10 40148… Myocas…    49.8    14.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
# … with 5,990 more rows, 169 more variables: protocol <chr>,
#   lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
#   occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
#   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
#   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
#   kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>,
#   species <chr>, genericName <chr>, specificEpithet <chr>, taxonRank <chr>, …

4. Data download through R

Mammals occurrence records from the Czech Republic

How many records do we have?

nrow(mammalsCZ)
[1] 6000


How many species do we have?

mammalsCZ %>% 
  filter(taxonRank=='SPECIES') %>% 
  distinct(scientificName) %>% nrow()
[1] 135

distinct() is used to see unique values

5. Data quality

5. Data quality

Data are not ‘good’ or ‘bad’, the quality will depend on our goal.
Some things we can check:

  • Base of the record (type of occurrence)
  • Species names (taxonomic harmonisation)
  • Spatial and temporal (accuracy / precision)

CoordinateCleaner: https://github.com/ropensci/CoordinateCleaner

Automated flagging of common spatial and temporal errors in data.

5. Data quality

As an example, we will check the following fields:

  • basisOfRecord: we want preserved specimens or observations
  • taxonRank: we want records at species level.
  • coordinateUncertaintyInMeters: we want them to be smaller than 10km.

5. Data quality

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ %>% distinct(basisOfRecord)
# A tibble: 7 × 1
  basisOfRecord     
  <chr>             
1 HUMAN_OBSERVATION 
2 OBSERVATION       
3 MATERIAL_SAMPLE   
4 PRESERVED_SPECIMEN
5 OCCURRENCE        
6 FOSSIL_SPECIMEN   
7 MATERIAL_CITATION 

distinct() is used to see unique values

5. Data quality

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ %>% group_by(basisOfRecord) %>% count()
# A tibble: 7 × 2
# Groups:   basisOfRecord [7]
  basisOfRecord          n
  <chr>              <int>
1 FOSSIL_SPECIMEN      105
2 HUMAN_OBSERVATION   4754
3 MATERIAL_CITATION     14
4 MATERIAL_SAMPLE      128
5 OBSERVATION           75
6 OCCURRENCE             2
7 PRESERVED_SPECIMEN   922

group_by() is used to group values within a variable

5. Data quality

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ <- mammalsCZ %>% 
  filter(basisOfRecord=='PRESERVED_SPECIMEN' |
           basisOfRecord=='HUMAN_OBSERVATION')

Note the use of | (OR) to filter the data.


How many records do we have now?

nrow(mammalsCZ)
[1] 5676

5. Data quality

  • taxonRank: we want records at species level
mammalsCZ %>% distinct(taxonRank)
# A tibble: 5 × 1
  taxonRank 
  <chr>     
1 SPECIES   
2 SUBSPECIES
3 GENUS     
4 ORDER     
5 FAMILY    

5. Data quality

  • taxonRank: we want records at species level
mammalsCZ <- mammalsCZ %>% 
  filter(taxonRank == 'SPECIES')


How many records do we have now?

nrow(mammalsCZ)
[1] 5294

5. Data quality

  • coordinateUncertaintyInMeters: we want them to be smaller than 10km
mammalsCZ %>% 
  filter(coordinateUncertaintyInMeters > 1000) %>% 
  select(scientificName, coordinateUncertaintyInMeters, stateProvince)
# A tibble: 505 × 3
   scientificName                             coordinateUncertaintyInM…¹ state…²
   <chr>                                                           <dbl> <chr>  
 1 Myotis nattereri (Kuhl, 1817)                                   26454 Středo…
 2 Myotis myotis (Borkhausen, 1797)                                26454 Středo…
 3 Myotis myotis (Borkhausen, 1797)                                26454 Středo…
 4 Myotis myotis (Borkhausen, 1797)                                26454 Středo…
 5 Rhinolophus hipposideros (Bechstein, 1800)                      26454 Středo…
 6 Rhinolophus hipposideros (Bechstein, 1800)                      26454 Středo…
 7 Myotis myotis (Borkhausen, 1797)                                26454 Středo…
 8 Barbastella barbastellus (Schreber, 1774)                       26454 Středo…
 9 Barbastella barbastellus (Schreber, 1774)                       26454 Středo…
10 Plecotus auritus (Linnaeus, 1758)                               26454 Středo…
# … with 495 more rows, and abbreviated variable names
#   ¹​coordinateUncertaintyInMeters, ²​stateProvince

5. Data quality

  • coordinateUncertaintyInMeters: we want them to be smaller than 10km
mammalsCZ <- mammalsCZ %>% 
  filter(coordinateUncertaintyInMeters < 10000) # keeping this


How many records do we have now?

nrow(mammalsCZ)
[1] 3894

How are the records distributed?

We’ll get to this next week :)

Any doubts?